Mental Health in Tech Data Analysis
# %pip install python-dotenv pandas numpy nbformat scipy scikit-learn statsmodels sqlalchemy plotly
from dotenv import load_dotenv
load_dotenv()
from sqlalchemy import create_engine
import sqlite3
from math import sqrt
import pandas as pd
import numpy as np
from scipy.stats import pointbiserialr, f_oneway, chi2_contingency, spearmanr
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from sklearn.metrics import r2_score
import plotly
import plotly.express as px
import plotly.io as pio
import plotly.graph_objects as go
import plotly.subplots as sp
pio.renderers.default = "notebook"
plotly.offline.init_notebook_mode()
from utils.functions import (
clean_gender_data,
clean_age_data,
clean_to_numerical,
database_filtering,
create_bar_plot,
)
cnx = sqlite3.connect("./datasets/mental_health.sqlite")
df = pd.read_sql_query("SELECT name FROM sqlite_master WHERE type='table'", cnx)
print(df)
name 0 Answer 1 Question 2 Survey
answer = pd.read_sql_query("SELECT * FROM Answer", cnx)
question = pd.read_sql_query("SELECT * FROM Question", cnx)
survey = pd.read_sql_query("SELECT * FROM Survey", cnx)
display(answer.head())
display(question.head())
display(survey.head())
| AnswerText | SurveyID | UserID | QuestionID | |
|---|---|---|---|---|
| 0 | 37 | 2014 | 1 | 1 |
| 1 | 44 | 2014 | 2 | 1 |
| 2 | 32 | 2014 | 3 | 1 |
| 3 | 31 | 2014 | 4 | 1 |
| 4 | 31 | 2014 | 5 | 1 |
| questiontext | questionid | |
|---|---|---|
| 0 | What is your age? | 1 |
| 1 | What is your gender? | 2 |
| 2 | What country do you live in? | 3 |
| 3 | If you live in the United States, which state ... | 4 |
| 4 | Are you self-employed? | 5 |
| SurveyID | Description | |
|---|---|---|
| 0 | 2014 | mental health survey for 2014 |
| 1 | 2016 | mental health survey for 2016 |
| 2 | 2017 | mental health survey for 2017 |
| 3 | 2018 | mental health survey for 2018 |
| 4 | 2019 | mental health survey for 2019 |
There are 3 tables:
AnswerText, SurveyID, UserID, QuestionID]questiontext, questionid]SurveyID, Description]print("Missing values in Answer table:")
print(answer.isnull().sum())
print("\nMissing values in Question table:")
print(question.isnull().sum())
print("\nMissing values in Survey table:")
print(survey.isnull().sum())
print("\nDuplicates in Answer table:")
print(answer.duplicated().sum())
print("\nDuplicates in Question table:")
print(question.duplicated().sum())
print("\nDuplicates in Survey table:")
print(survey.duplicated().sum())
Missing values in Answer table: AnswerText 0 SurveyID 0 UserID 0 QuestionID 0 dtype: int64 Missing values in Question table: questiontext 0 questionid 0 dtype: int64 Missing values in Survey table: SurveyID 0 Description 0 dtype: int64 Duplicates in Answer table: 0 Duplicates in Question table: 0 Duplicates in Survey table: 0
1. Merging Tables:
QuestionID (referred to as questionid in the Question table) and SurveyID values.SurveyID), followed by QuestionID, and ending with the most detailed data (UserID).merged_table = pd.merge(
answer, question, left_on="QuestionID", right_on="questionid", how="outer"
)
merged_table.drop("questionid", axis=1, inplace=True)
merged_table = pd.merge(merged_table, survey, on="SurveyID", how="outer")
merged_table.rename(columns={"questiontext": "QuestionText"}, inplace=True)
merged_table["SurveyID"] = merged_table["SurveyID"].astype(int)
merged_table["QuestionID"] = merged_table["QuestionID"].astype(int)
merged_table["UserID"] = merged_table["UserID"].astype(int)
merged_table.sort_values(by=["SurveyID", "QuestionID", "UserID"], inplace=True)
merged_table = merged_table.reset_index(drop=True)
2. Preparing the data for display:
processed_dataset = merged_table.copy()
processed_dataset.loc[
processed_dataset["QuestionID"] == 2, "AnswerText"
] = clean_gender_data(
processed_dataset[processed_dataset["QuestionID"] == 2], "AnswerText"
)
processed_dataset.loc[
processed_dataset["QuestionID"] == 1, "AnswerText"
] = clean_age_data(
processed_dataset[processed_dataset["QuestionID"] == 1], "AnswerText"
)
processed_dataset.loc[
processed_dataset["QuestionID"] == 7, "AnswerText"
] = clean_to_numerical(
processed_dataset[processed_dataset["QuestionID"] == 7], "AnswerText"
)
processed_dataset.loc[
processed_dataset["QuestionID"] == 12, "AnswerText"
] = clean_to_numerical(
processed_dataset[processed_dataset["QuestionID"] == 12], "AnswerText"
)
processed_dataset.loc[
processed_dataset["QuestionID"].isin([28, 98]), "AnswerText"
] = clean_to_numerical(
processed_dataset[processed_dataset["QuestionID"].isin([28, 98])], "AnswerText"
)
processed_dataset.loc[
processed_dataset["QuestionID"].isin([56, 102]), "AnswerText"
] = clean_to_numerical(
processed_dataset[processed_dataset["QuestionID"].isin([56, 102])], "AnswerText"
)
processed_dataset.head()
C:\Users\aga\Documents\GitHub\athiel-DA.1.5\utils\functions.py:258: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
| AnswerText | SurveyID | UserID | QuestionID | QuestionText | Description | |
|---|---|---|---|---|---|---|
| 0 | 37 | 2014 | 1 | 1 | What is your age? | mental health survey for 2014 |
| 1 | 44 | 2014 | 2 | 1 | What is your age? | mental health survey for 2014 |
| 2 | 32 | 2014 | 3 | 1 | What is your age? | mental health survey for 2014 |
| 3 | 31 | 2014 | 4 | 1 | What is your age? | mental health survey for 2014 |
| 4 | 31 | 2014 | 5 | 1 | What is your age? | mental health survey for 2014 |
3. Prepare the new file for clean DataFrame
engine = create_engine("sqlite:///./datasets/mental_health_clean.sqlite")
processed_dataset.to_sql("Answer", engine, if_exists="replace", index=False)
engine = create_engine("sqlite:///./datasets/mental_health_clean.sqlite")
cnx2 = sqlite3.connect("./datasets/mental_health_clean.sqlite")
OUTCOMES:
prism_palette = px.colors.qualitative.Prism
question_counts = (
processed_dataset.groupby("SurveyID")["QuestionID"].nunique().reset_index()
)
user_counts = processed_dataset.groupby("SurveyID")["UserID"].nunique().reset_index()
fig_question = go.Figure(
data=[
go.Bar(
x=question_counts["SurveyID"], y=question_counts["QuestionID"], width=0.2
)
]
)
fig_question.update_layout(
title="Number of Questions per Survey Year",
xaxis_title="Survey Year",
yaxis_title="Number of Questions",
template="plotly_dark",
)
fig_question.show()
fig_user = go.Figure(
data=[go.Bar(x=user_counts["SurveyID"], y=user_counts["UserID"], width=0.2)]
)
fig_user.update_layout(
title="Number of Users per Survey Year",
xaxis_title="Survey Year",
yaxis_title="Number of Users",
template="plotly_dark",
)
fig_user.show()
OUTCOMES:
Scrutinizing the dataset to identify key patterns, relationships, and trends. This process aids in detecting significant variables and anomalies, leading to more accurate predictions and insights.
Preliminary Plan for Data Exploration
Basic Exploration
describe function to provide an overview of numerical and categorical features in each dataset.Detailed Exploration
Respondent Characteristics Analysis:
Mental Health Questions Response Analysis:
Relationship Analysis:
print("Processed Dataset")
print(processed_dataset.describe(include=["object"]).T)
Processed Dataset
count unique top freq
AnswerText 236898 4118 -1 56110
QuestionText 236898 105 What is your age? 4218
Description 236898 5 mental health survey for 2016 88238
OUTCOMES:
HYPOTHESIS: In the field of IT, professionals may exhibit significant differences in various aspects such as job roles, experience levels, and backgrounds. This analysis hypothesizes that exploring these differences can enhance our understanding of their diverse responses to the survey topics. The validity of this hypothesis will be tested in the subsequent analysis.
query_age_distribution_tech = """
SELECT SurveyID,
CASE
WHEN AnswerText BETWEEN 18 AND 24 THEN '18-24'
WHEN AnswerText BETWEEN 25 AND 34 THEN '25-34'
WHEN AnswerText BETWEEN 35 AND 44 THEN '35-44'
WHEN AnswerText BETWEEN 45 AND 54 THEN '45-54'
WHEN AnswerText BETWEEN 55 AND 64 THEN '55-64'
WHEN AnswerText BETWEEN 65 AND 99 THEN 'Over 65 '
END AS age_group,
COUNT(*) AS count
FROM Answer
WHERE SurveyID IN (2014, 2016, 2017, 2018, 2019)
AND (QuestionID = 1 OR QuestionID IN (9, 13)) -- Include Question 1, 9, and 13
AND (AnswerText > 18 OR AnswerText IN (-1, 1)) -- Include -1 and 1 for both questions
GROUP BY SurveyID, age_group
ORDER BY SurveyID, age_group;
"""
age_survey_distribution_tech = pd.read_sql(query_age_distribution_tech, cnx2)
max_y_value = 800
fig = px.line(
age_survey_distribution_tech,
x="age_group",
y="count",
color="SurveyID",
labels={"age_group": "Age Group", "count": "Count", "SurveyID": "Survey Year"},
title="Count of Users by Age Group and Tech Employment (Questions 1, 9, and 13)",
category_orders={
"age_group": ["18-24", "25-34", "35-44", "45-54", "55-64", "Over 65 "]
},
color_discrete_sequence=prism_palette,
template="plotly_dark",
)
fig.update_yaxes(range=[0, max_y_value])
fig.show()
query_gender_distribution = """
SELECT SurveyID,
SUM(CASE WHEN AnswerText = 'Male' THEN 1 ELSE 0 END) AS male_count,
SUM(CASE WHEN AnswerText = 'Female' THEN 1 ELSE 0 END) AS female_count,
SUM(CASE WHEN AnswerText = 'LGBTQIA+' THEN 1 ELSE 0 END) AS lgbtqia_count,
SUM(CASE WHEN AnswerText = 'Unknown' THEN 1 ELSE 0 END) AS unknown_count
FROM Answer
WHERE QuestionID == 2 OR (QuestionID == 9 AND AnswerText == 1) OR (QuestionID == 13 AND AnswerText == 1)
GROUP BY SurveyID
"""
df = pd.read_sql(query_gender_distribution, cnx2)
prism_colors = [
"#5f4690",
"#1d6a96",
"#38a6a4",
"#0f8554",
"#73af48",
"#edac08",
"#e17a05",
"#cc513e",
"#94346e",
"#6f4070",
"#666666",
]
df = df.rename(
columns={
"male_count": "Male",
"female_count": "Female",
"lgbtqia_count": "LGBTQIA+",
"unknown_count": "Unknown",
}
)
fig = px.bar(
df,
x="SurveyID",
y=["Male", "Female", "LGBTQIA+", "Unknown"],
labels={"value": "Count", "variable": "Gender"},
title="Gender Distribution in Tech Surveys",
color_discrete_sequence=prism_palette,
template="plotly_dark",
)
fig.update_layout(
barmode="group", xaxis_title="Survey ID", yaxis_title="Count", legend_title="Gender"
)
fig.show()
query_country_distribution = """
SELECT SurveyID, AnswerText AS Country, COUNT(*) AS Employee_count
FROM Answer
WHERE QuestionID = 50
GROUP BY SurveyID, Country
ORDER BY SurveyID, Employee_count DESC
"""
df = pd.read_sql(query_country_distribution, cnx2)
df["Employee_count"] = df["Employee_count"].astype(int)
top_countries_by_survey = (
df.groupby("SurveyID")
.apply(lambda x: x.nlargest(10, "Employee_count"))
.reset_index(drop=True)
)
fig = sp.make_subplots(
rows=1,
cols=1,
subplot_titles=["Top 10 Country of Work Distribution in Tech Surveys"],
)
average_button = dict(
label="Average",
method="update",
args=[{"visible": [True] * len(top_countries_by_survey)}],
)
buttons = [average_button]
for survey_id in top_countries_by_survey["SurveyID"].unique():
button = dict(
label=str(survey_id),
method="update",
args=[
{
"visible": [
survey_id == sid
for sid in top_countries_by_survey["SurveyID"].unique()
]
}
],
)
buttons.append(button)
fig.update_layout(
updatemenus=[
dict(
type="buttons",
showactive=True,
buttons=buttons,
x=1,
xanchor="right",
y=1.15,
yanchor="top",
)
]
)
for survey_id, color in zip(
top_countries_by_survey["SurveyID"].unique(), prism_palette
):
data = top_countries_by_survey[top_countries_by_survey["SurveyID"] == survey_id]
trace = go.Bar(
x=data["Country"],
y=data["Employee_count"],
name=f"Survey {survey_id}",
marker_color=color,
)
fig.add_trace(trace)
fig.update_layout(
xaxis_title="Country",
yaxis_title="Number of Employees",
title_x=0.5,
template="plotly_dark",
)
fig.show()
df["Employee_count"] = df["Employee_count"].astype(int)
total_respondents = df["Employee_count"].sum()
top_countries = df.groupby("Country").sum().nlargest(10, "Employee_count")
for country, count in top_countries.iterrows():
percentage = (count["Employee_count"] / total_respondents) * 100
print(
f"{country}: {count['Employee_count']} respondents ({percentage:.2f}% of total)"
)
United States of America: 1872 respondents (63.29% of total) United Kingdom: 300 respondents (10.14% of total) Canada: 122 respondents (4.12% of total) Germany: 90 respondents (3.04% of total) Netherlands: 70 respondents (2.37% of total) Australia: 50 respondents (1.69% of total) India: 39 respondents (1.32% of total) France: 34 respondents (1.15% of total) Brazil: 30 respondents (1.01% of total) Spain: 24 respondents (0.81% of total)
query_users_participated = """
SELECT UserID, AnswerText AS Country
FROM Answer
WHERE QuestionID = 50
"""
df = pd.read_sql(query_users_participated, cnx2)
total_unique_respondents = df["UserID"].nunique()
df_countries = df.groupby("Country").size().reset_index(name="Employee_count")
top_countries = df_countries.nlargest(4, "Employee_count")
for country, count in top_countries.iterrows():
percentage = (count["Employee_count"] / total_unique_respondents) * 100
print(
f"{country}: {count['Employee_count']} respondents ({percentage:.2f}% of total unique respondents)"
)
print(f"\nTotal unique respondents: {total_unique_respondents}")
73: 1872 respondents (63.29% of total unique respondents) 72: 300 respondents (10.14% of total unique respondents) 13: 122 respondents (4.12% of total unique respondents) 28: 90 respondents (3.04% of total unique respondents) Total unique respondents: 2958
query_state_distribution = """
SELECT
year,
CASE WHEN state = -1 THEN 'Unknown' ELSE state END as state,
COUNT(DISTINCT UserID) AS Employee_count,
SurveyID
FROM (
SELECT
UserID,
SurveyID AS year,
MAX(CASE WHEN QuestionID == 4 AND AnswerText != -1 THEN AnswerText END) AS state,
MAX(CASE WHEN QuestionID == 9 THEN AnswerText END) AS Tech_company,
MAX(CASE WHEN QuestionID == 13 THEN AnswerText END) AS Tech_role,
SurveyID
FROM Answer
WHERE
(QuestionID == 4 AND AnswerText != -1)
OR (QuestionID == 9 AND AnswerText == 1)
OR (QuestionID == 13 AND AnswerText == 1)
GROUP BY UserID, SurveyID
HAVING state IS NOT NULL AND (Tech_company IS NOT NULL OR Tech_role IS NOT NULL)
)
GROUP BY year, state, SurveyID
ORDER BY year, Employee_count DESC;
"""
df = pd.read_sql(query_state_distribution, cnx2)
df["Employee_count"] = df["Employee_count"].astype(int)
top_states_by_survey = (
df.groupby("SurveyID")
.apply(lambda x: x.nlargest(10, "Employee_count"))
.reset_index(drop=True)
)
fig = sp.make_subplots(
rows=1,
cols=1,
subplot_titles=["Top 10 US States of Work Distribution in Tech Surveys"],
)
buttons = [
dict(
label="Average",
method="update",
args=[{"visible": [True] * len(top_states_by_survey)}],
)
]
for survey_id in top_states_by_survey["SurveyID"].unique():
button = dict(
label=str(survey_id),
method="update",
args=[
{
"visible": [
survey_id == sid
for sid in top_states_by_survey["SurveyID"].unique()
]
}
],
)
buttons.append(button)
fig.update_layout(
updatemenus=[
dict(
type="buttons",
showactive=True,
buttons=buttons,
x=1,
xanchor="right",
y=1.15,
yanchor="top",
)
]
)
for survey_id, color in zip(top_states_by_survey["SurveyID"].unique(), prism_palette):
data = top_states_by_survey[top_states_by_survey["SurveyID"] == survey_id]
trace = go.Bar(
x=data["state"],
y=data["Employee_count"],
name=f"Survey {survey_id}",
marker_color=color,
)
fig.add_trace(trace)
fig.update_layout(
xaxis_title="State",
yaxis_title="Number of Employees",
title_x=0.5,
template="plotly_dark",
)
fig.show()
df = pd.read_sql(query_state_distribution, cnx2)
df = df[df["state"].notna()]
df["Employee_count"] = df["Employee_count"].astype(int)
total_respondents = df["Employee_count"].sum()
top_states = df.groupby("state").sum().nlargest(10, "Employee_count")
for state, count in top_states.iterrows():
percentage = (count["Employee_count"] / total_respondents) * 100
print(
f"{state}: {count['Employee_count']} respondents ({percentage:.2f}% of total)"
)
California: 336 respondents (15.22% of total) Illinois: 187 respondents (8.47% of total) Washington: 150 respondents (6.79% of total) New York: 123 respondents (5.57% of total) Tennessee: 105 respondents (4.76% of total) Texas: 98 respondents (4.44% of total) Michigan: 96 respondents (4.35% of total) Ohio: 95 respondents (4.30% of total) Oregon: 84 respondents (3.80% of total) Minnesota: 82 respondents (3.71% of total)
OUTCOMES:
INSIGHTS:
CAUTION: The data reveals notable disproportions across the IT landscape, raising concerns about underrepresentation in certain groups. Countries and states exhibit uneven participation, with some regions lacking representation while others dominate, introducing potential biases in the analysis of respondents' answers. Recognizing these limitations, the decision has been made to focus testing on gender and age data, where the biases are assumed to be less pronounced compared to countries and states with more diverse and underrepresented groups.
query_treatment = """
SELECT
Treatment,
SUM(CASE year WHEN 2014 THEN percentage END) AS Y2014,
SUM(CASE year WHEN 2016 THEN percentage END) AS Y2016,
SUM(CASE year WHEN 2017 THEN percentage END) AS Y2017,
SUM(CASE year WHEN 2018 THEN percentage END) AS Y2018,
SUM(CASE year WHEN 2019 THEN percentage END) AS Y2019
FROM (
SELECT year, Treatment,
COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (PARTITION BY year) AS percentage
FROM (
SELECT SurveyID AS year,
MAX(CASE WHEN (QuestionID == 13 AND AnswerText == 1) THEN AnswerText END) AS Tech_company,
MAX(CASE WHEN (QuestionID == 9 AND AnswerText == 1) THEN AnswerText END) AS Tech_role,
MAX(CASE WHEN (QuestionID == 7) THEN AnswerText END) AS Treatment,
COUNT (*) as count
FROM Answer
WHERE QuestionID IN (7, 9, 13)
GROUP BY UserID, SurveyID
HAVING (Tech_company IS NOT NULL OR Tech_role IS NOT NULL)
) AS t
GROUP BY year, Treatment
) AS t2
GROUP BY Treatment
"""
df = pd.read_sql(query_treatment, cnx2)
df["Treatment"] = pd.to_numeric(df["Treatment"])
title = "Seeking Mental Health Treatment in the Past"
xaxis_title = "Treatment Response"
column = "Treatment"
create_bar_plot(df, column, title, xaxis_title)
fig = go.Figure()
query_interview = """
SELECT
Interview,
SUM(CASE year WHEN 2014 THEN percentage END) AS Y2014,
SUM(CASE year WHEN 2016 THEN percentage END) AS Y2016,
SUM(CASE year WHEN 2017 THEN percentage END) AS Y2017,
SUM(CASE year WHEN 2018 THEN percentage END) AS Y2018,
SUM(CASE year WHEN 2019 THEN percentage END) AS Y2019
FROM (
SELECT year, Interview,
COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (PARTITION BY year) AS percentage
FROM (
SELECT SurveyID AS year,
MAX(CASE WHEN (QuestionID == 13 AND AnswerText == 1) THEN AnswerText END) AS Tech_company,
MAX(CASE WHEN (QuestionID == 9 AND AnswerText == 1) THEN AnswerText END) AS Tech_role,
MAX(CASE WHEN (QuestionID == 12) THEN AnswerText END) AS Interview,
COUNT (*) as count
FROM Answer
WHERE QuestionID IN (9, 12, 13)
GROUP BY UserID, SurveyID
HAVING (Tech_company IS NOT NULL OR Tech_role IS NOT NULL)
) AS t
GROUP BY year, Interview
) AS t2
GROUP BY Interview
"""
df = pd.read_sql(query_interview, cnx2)
df["Interview"] = pd.to_numeric(df["Interview"])
title = "Bringing Up Mental Health Issue Over Interview"
xaxis_title = "Interview Response"
column = "Interview"
create_bar_plot(df, column, title, xaxis_title)
fig = go.Figure()
query_discussing = """
SELECT
MAX(Discussing) AS Discussing,
SUM(CASE year WHEN 2014 THEN percentage END) AS Y2014,
SUM(CASE year WHEN 2016 THEN percentage END) AS Y2016,
SUM(CASE year WHEN 2017 THEN percentage END) AS Y2017,
SUM(CASE year WHEN 2018 THEN percentage END) AS Y2018,
SUM(CASE year WHEN 2019 THEN percentage END) AS Y2019
FROM (
SELECT year, MAX(Discussing) AS Discussing,
COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (PARTITION BY year) AS percentage
FROM (
SELECT SurveyID AS year,
MAX(CASE WHEN (QuestionID == 13 AND AnswerText == 1) THEN AnswerText END) AS Tech_company,
MAX(CASE WHEN (QuestionID == 9 AND AnswerText == 1) THEN AnswerText END) AS Tech_role,
MAX(CASE WHEN (QuestionID == 98 OR QuestionID == 28) THEN AnswerText END) AS Discussing,
COUNT (*) as count
FROM Answer
WHERE QuestionID IN (9, 13, 98, 28)
GROUP BY UserID, SurveyID
HAVING (Tech_company IS NOT NULL OR Tech_role IS NOT NULL)
) AS t
GROUP BY year, Discussing
) AS t2
GROUP BY Discussing
"""
df = pd.read_sql(query_discussing, cnx2)
df["Discussing"] = clean_to_numerical(df, "Discussing")
df["Discussing"] = pd.to_numeric(df["Discussing"], errors="coerce")
df_grouped = df.groupby("Discussing").sum().reset_index()
df_grouped = df_grouped.dropna(subset=["Discussing"])
display(df_grouped)
title = "Fear of Negative Consequences in Mental Health Discussions"
xaxis_title = "Discussing Response"
column = "Discussing"
create_bar_plot(df_grouped, column, title, xaxis_title)
| Discussing | Y2014 | Y2016 | Y2017 | Y2018 | Y2019 | |
|---|---|---|---|---|---|---|
| 0 | -1.0 | 0.000000 | 17.860301 | 17.610063 | 20.448179 | 21.694915 |
| 1 | 0.0 | 41.707081 | 6.896552 | 6.446541 | 6.442577 | 4.406780 |
| 2 | 0.5 | 36.469447 | 45.092838 | 44.025157 | 43.137255 | 41.016949 |
| 3 | 1.0 | 21.823472 | 30.150309 | 31.918239 | 29.971989 | 32.881356 |
fig = go.Figure()
query_handling = """
SELECT
Handling,
SUM(CASE year WHEN 2014 THEN percentage END) AS Y2014,
SUM(CASE year WHEN 2016 THEN percentage END) AS Y2016,
SUM(CASE year WHEN 2017 THEN percentage END) AS Y2017,
SUM(CASE year WHEN 2018 THEN percentage END) AS Y2018,
SUM(CASE year WHEN 2019 THEN percentage END) AS Y2019
FROM (
SELECT year, Handling,
COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (PARTITION BY year) AS percentage
FROM (
SELECT SurveyID AS year,
MAX(CASE WHEN (QuestionID == 13 AND AnswerText == 1) THEN AnswerText END) AS Tech_company,
MAX(CASE WHEN (QuestionID == 9 AND AnswerText == 1) THEN AnswerText END) AS Tech_role,
MAX(CASE WHEN (QuestionID == 56 OR QuestionID == 102) THEN AnswerText END) AS Handling,
COUNT (*) as count
FROM Answer
WHERE QuestionID IN (9, 13, 56, 102)
GROUP BY UserID, SurveyID
HAVING (Tech_company IS NOT NULL OR Tech_role IS NOT NULL)
) AS t
GROUP BY year, Handling
) AS t2
GROUP BY Handling
"""
df = pd.read_sql(query_handling, cnx2)
df["Handling"] = pd.to_numeric(df["Handling"])
title = "Witnessed Poor Handling of Mental Health Issues at Work"
xaxis_title = "Handling Response"
column = "Handling"
create_bar_plot(df, column, title, xaxis_title)
OUTCOMES:
INSIGHTS:
filtered_db_path = "./datasets/mental_health_clean_filtered.sqlite"
valid_question_ids = [1, 2, 4, 7, 9, 12, 13, 28, 50, 56, 98, 102]
df_combined = database_filtering(
processed_dataset, valid_question_ids, filtered_db_path
)
Columns in create_country_subquery: Index(['UserID', 'Country'], dtype='object') Columns in create_state_subquery: Index(['UserID', 'State'], dtype='object') Columns in create_treatment_subquery: Index(['UserID', 'Treatment'], dtype='object') Columns in create_interview_subquery: Index(['UserID', 'Interview'], dtype='object')
print(f"{df_combined['Age'].unique()}\n")
print(f"{df_combined['Gender'].unique()}\n")
print(f"{df_combined['Treatment'].unique()}\n")
print(f"{df_combined['Interview'].unique()}\n")
print(f"{df_combined['Discussing'].unique()}\n")
print(f"{df_combined['Handling'].unique()}\n")
[37 32 31 33 35 39 42 23 29 36 27 46 34 30 40 38 50 24 18 28 26 22 19 25 44 45 43 41 21 56 60 54 55 48 20 57 58 47 62 51 49 53 61 52 63 66 59 65 70 64 67] ['Female' 'Male' 'LGBTQIA+' 'Unknown'] [1 0] [0.0 1.0 0.5] [0.0 1.0 0.5 -1.0] [0.0 1.0 0.5 -1.0]
point_biserial_corr, p_value = pointbiserialr(
df_combined["Treatment"], df_combined["Age"]
)
print(
f"Point-Biserial Correlation: {point_biserial_corr}, p-value: {p_value} for Age vs Treatment ('Seeking Mental Health Treatment in the Past')"
)
Point-Biserial Correlation: 0.06802000977688193, p-value: 1.8625509518119982e-05 for Age vs Treatment ('Seeking Mental Health Treatment in the Past')
spearman_corr, p_value = spearmanr(
df_combined["Age"], df_combined["Interview"].astype(float)
)
print(
f"Spearman Rank Correlation: {spearman_corr}, p-value: {p_value} for Age vs Interview ('Bringing Up Mental Health Issue Over Interview')"
)
Spearman Rank Correlation: 0.0013022290674804377, p-value: 0.9347585190895162 for Age vs Interview ('Bringing Up Mental Health Issue Over Interview')
spearman_corr, p_value = spearmanr(df_combined["Age"], df_combined["Discussing"])
print(
f"Spearman Rank Correlation: {spearman_corr}, p-value: {p_value} for Age vs Discussing (Fear of Negative Consequences in Mental Health Discussions)"
)
Spearman Rank Correlation: 0.13468394108357834, p-value: 1.8137854566051697e-17 for Age vs Discussing (Fear of Negative Consequences in Mental Health Discussions)
spearman_corr, p_value = spearmanr(df_combined["Age"], df_combined["Handling"])
print(
f"Spearman Rank Correlation: {spearman_corr}, p-value: {p_value} for Age vs Handling (Witnessed Poor Handling of Mental Health Issues at Work)"
)
Spearman Rank Correlation: 0.11947128064109927, p-value: 4.8058393227034167e-14 for Age vs Handling (Witnessed Poor Handling of Mental Health Issues at Work)
male_treatment = df_combined[df_combined["Gender"] == "Male"]["Treatment"]
female_treatment = df_combined[df_combined["Gender"] == "Female"]["Treatment"]
lgbtqia_treatment = df_combined[df_combined["Gender"] == "LGBTQIA+"]["Treatment"]
unknown_treatment = df_combined[df_combined["Gender"] == "Unknown"]["Treatment"]
f_statistic, p_value = f_oneway(
male_treatment, female_treatment, lgbtqia_treatment, unknown_treatment
)
print(
f"ANOVA F-statistic: {f_statistic}, p-value: {p_value} for Gender vs Treatment ('Seeking Mental Health Treatment in the Past')"
)
ANOVA F-statistic: 56.47830940124604, p-value: 9.657269285296993e-36 for Gender vs Treatment ('Seeking Mental Health Treatment in the Past')
tukey_results = pairwise_tukeyhsd(df_combined["Treatment"], df_combined["Gender"])
print(tukey_results)
Multiple Comparison of Means - Tukey HSD, FWER=0.05
========================================================
group1 group2 meandiff p-adj lower upper reject
--------------------------------------------------------
Female LGBTQIA+ 0.083 0.3787 -0.0503 0.2164 False
Female Male -0.2173 0.0 -0.2639 -0.1707 True
Female Unknown -0.0545 0.8985 -0.2555 0.1465 False
LGBTQIA+ Male -0.3003 0.0 -0.4295 -0.1711 True
LGBTQIA+ Unknown -0.1375 0.4328 -0.3719 0.0969 False
Male Unknown 0.1628 0.15 -0.0355 0.3611 False
--------------------------------------------------------
df_combined["Gender_Code"] = df_combined["Gender"].map(
{"Male": 1, "Unknown": 2, "Female": 3, "LGBTQIA+": 4}
)
mean_values = df_combined.groupby("Gender_Code")["Treatment"].mean()
x_values = df_combined["Gender_Code"]
y_values = df_combined["Treatment"]
m, b = np.polyfit(x_values, y_values, deg=1)
y_pred = m * x_values + b
r_squared = r2_score(y_values, y_pred)
fig = go.Figure()
fig.add_trace(
go.Scatter(
x=df_combined["Gender_Code"],
y=df_combined["Treatment"],
mode="markers",
name="Data Points",
text=df_combined["Gender_Code"],
)
)
fig.add_trace(
go.Scatter(
x=df_combined["Gender_Code"],
y=m * df_combined["Gender_Code"] + b,
mode="lines",
name="Regression Line",
text=[f"Y = {m:.2f}X + {b:.2f}" for _ in df_combined["Gender_Code"]],
)
)
fig.update_layout(
hovermode="closest",
hoverlabel=dict(bgcolor="white", bordercolor="black"),
xaxis_title="Gender Code",
yaxis_title="Treatment",
title="Scatter Plot with Regression Line",
template="plotly_dark",
)
fig.show()
contingency_table = pd.crosstab(df_combined["Interview"], df_combined["Gender"])
chi2, p, _, _ = chi2_contingency(contingency_table)
n = df_combined.shape[0]
categories_gender = len(df_combined["Gender"].unique())
categories_interview = len(df_combined["Interview"].unique())
cramers_v = np.sqrt(chi2 / (n * (min(categories_gender, categories_interview) - 1)))
print(
f"Chi-square value: {chi2}, p-value: {p} for Gender vs Interview ('Bringing Up Mental Health Issue Over Interview')"
)
print(f"Cramér's V: {cramers_v}")
Chi-square value: 89.0708561088282, p-value: 4.7248453716916386e-17 for Gender vs Interview ('Bringing Up Mental Health Issue Over Interview')
Cramér's V: 0.10612910040159637
observed = pd.crosstab(df_combined["Gender"], df_combined["Discussing"])
chi2, p, _, _ = chi2_contingency(observed)
n = observed.sum().sum()
phi2 = chi2 / n
cramers_v = sqrt(phi2 / min(observed.shape[0] - 1, observed.shape[1] - 1))
print(
f"Chi-square value: {chi2}, p-value: {p} for Gender vs Discussing (Fear of Negative Consequences in Mental Health Discussions)"
)
print(f"Cramér's V: {cramers_v}")
Chi-square value: 61.436824213018426, p-value: 7.080333712035073e-10 for Gender vs Discussing (Fear of Negative Consequences in Mental Health Discussions) Cramér's V: 0.07196733474074374
observed = pd.crosstab(df_combined["Gender"], df_combined["Handling"])
chi2, p, _, _ = chi2_contingency(observed)
n = observed.sum().sum()
phi2 = chi2 / n
cramers_v = sqrt(phi2 / min(observed.shape[0] - 1, observed.shape[1] - 1))
print(
f"Chi-square value: {chi2}, p-value: {p} for Gender vs Handling (Witnessed Poor Handling of Mental Health Issues at Work)"
)
print(f"Cramér's V: {cramers_v}")
Chi-square value: 100.70542639162835, p-value: 1.1328077252907465e-17 for Gender vs Handling (Witnessed Poor Handling of Mental Health Issues at Work) Cramér's V: 0.09213984481481263
cnx.close()
cnx2.close()
OUTCOMES:
INSIGHTS:
SUMMARY:
Distribution of Respondent Characteristics
Treatment Variation Across Genders:
Impact of Age on Perceptions:
Gender's Influence on Mental Health Discussions:
INSIGHTS:
The correlation analysis revealed no significant correlation between 'Age' and 'Seeking Mental Health Treatment in the Past' or 'Bringing Up Mental Health Issue Over Interview.' The observed relationships were weak, and the p-values exceeded the 0.05 significance level.
A statistically significant weak positive correlation exists between 'Age' and the likelihood of witnessing poor handling of mental health issues at work. Younger individuals may have a slightly higher tendency to witness poor handling of mental health issues at work.
A statistically significant weak negative correlation exists between 'Age' and the perceptions of discussing mental health with a boss. Younger individuals may be more inclined to perceive discussing mental health negatively.
Significant differences were found in the 'Treatment' means between Male and Female groups, based on the ANOVA and post-hoc Tukey HSD test. The regression analysis suggests a positive association between 'Gender_Code' and the likelihood of seeking mental health treatment in the past. However, the model's low R2 indicates limited explanatory power, and caution is advised in drawing strong conclusions from this analysis.
The chi-square test revealed a statistically significant association between 'Gender' and 'Bringing Up Mental Health Issue Over Interview.' However, Cramér's V (0.1153) indicates a relatively weak practical relationship. While evidence of a connection exists, the effect size is modest, suggesting other factors may also influence participants' responses.
The chi-square test suggests a significant association between 'Gender' and the likelihood of perceiving discussing mental health with a boss as having negative consequences. Cramér's V (0.0915) indicates a small to modest effect size, suggesting a weak association between gender and the perception of discussing mental health.
There's a significant association between gender and the likelihood of witnessing poor handling of mental health issues at work. Cramér's V (0.0915) reflects a small to modest effect size, indicating a weak association between gender and the perception of poor handling of mental health issues.
POTENTIAL AREAS FOR INVESTIGATION:
Observations Relevant to Clients:
Observations Relevant to Analysts (processing would require more data):
RECOMMENDATIONS: